In this project, we are going to explore the properties of white wine and its relation to its quality. The text file is located here with more information regarding the physicochemical variables.
First, we will load the dataset into R and examine its features.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
free.sulfur.dioxide and total.sulfur.dioxide are discrete whereas all other input variables are continuous.
quality is normally distributed with most occuring values at 6.
It seems that most of our input variables such as chlorides and residual.sugar contains outliers. I decided to rescale the axis and determine whether the distribution is normal or skewed. Some observation on the distribution of the chemical property can be made:
Positive-Skewed + Outliers : residual.sugar, free.sulfur.dioxide, chlorides, volatile.acidity
get_ori_plot <- function(var, label) {
ggplot(aes(x = (var)), data = df) + xlab(label) +
geom_histogram(colour = "black",
fill = '#dbdd46',
bins = 30)
}
get_sqrt_plot <- function(var, label) {
ggplot(aes(x = sqrt(var)), data = df) + xlab(label) +
geom_histogram(colour = "black",
fill = '#dbdd46',
bins = 30)
}
get_log_plot <- function(var, label) {
ggplot(aes(x = log10(var)), data = df) + xlab(label) +
geom_histogram(colour = "black",
fill = '#dbdd46',
bins = 30)
}
I decided to take the transformation for the positive-skewed features to determine whether it would display normal distribution afterwards. An example is shown below:
sulphates displayed preferred normal distribution and created a new variable for that transformation.
ratingIt was shown that quality was normally distributed with 6 as the most frequent rating. I decided to create a catagorical variable grade for future analysis with various features. In this project, we consider a rating of 3 - 4: Bad, 5 - 7: Average, and 8 - 9 : Good.
##
## Bad Average Good
## 183 4535 180
Bad and Good wines.
free.to.boundAfter examining the structure of our dataset, I decided to examine the relations between variables. I decided to take a first look at our discrete input features: free.sulfur.dioxide and total.sulfur.dioxide. It was shown that total.sulfur.dioxide was composed of free.sulfur.dioxide and bound.sulfur.dioxide. The ratio seemed more appropriate than using free.sulfur.dioxide to total.sulfur.dioxide.
df$bound.sulfur.dioxide <-
df$total.sulfur.dioxide - df$free.sulfur.dioxide
df$free.to.bound <- df$free.sulfur.dioxide / df$bound.sulfur.dioxide
We got a quick overview on the distribution of each feature in our dataset. Our main interest quality was normally distributed with the most occuring value at 6. It was also shown that all input features were continuous but the sulfur.dioxide features. Since some of our features were positive-skewed, we created methods for transforming our feature to a more appropriate and normally distributed feature. After researching total.sulfur.dioxide, we created a new feature: the ratio of free.sulfur.dioxide to bound.sulfur.dioxide to analyze in the further sections. We also saw that majority of our features had outliers and would affect our future plots. Thus rather removing the outliers, we decided to rescale future plots to come.
I decided to examine the relation between output and the input variables through the peason-r coefficient.
## [,1]
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## free.sulfur.dioxide 0.008158067
## total.sulfur.dioxide -0.174737218
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
## quality 1.000000000
## bound.sulfur.dioxide -0.217867760
## free.to.bound 0.164797933
## [,1]
## fixed.acidity FALSE
## volatile.acidity FALSE
## citric.acid FALSE
## residual.sugar FALSE
## chlorides TRUE
## free.sulfur.dioxide FALSE
## total.sulfur.dioxide FALSE
## density TRUE
## pH FALSE
## sulphates FALSE
## alcohol TRUE
## quality TRUE
## bound.sulfur.dioxide TRUE
## free.to.bound FALSE
From the correlation matrix, we can see a relative correlation between..
alcohol and density
## [1] -0.7801376
density and residual.sugar
## [1] 0.8389665
total.sulfur.dioxide and density
## [1] 0.5298813
quality CorrelationI wanted to focus on 5 features that are highly correlated with quality:
Before plotting, I wanted to examine the transformation among the variables of interest:
bound.sulfur.dioxidebound.sulfur.dioxide displayed a normal distribution.
alcoholalcohol displayed a normal distribution.
chlorideschlorides displayed a better normal distribution.
volatile.acidityvolatile.acidity displayed a normal distribution.
Since our main focus was on discrete variable quality, I decided to use boxplots to explore the correlated features:
# Function to make boxplots
make_boxplot <- function(xvar, yvar, title, xlabel, ylabel) {
ggplot(df, aes(x = xvar, y = yvar, fill = xvar)) +
geom_jitter(alpha = .3) +
geom_boxplot(alpha = 0.5, color = 'blue') +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4) +
scale_fill_manual(values = colors) +
theme(legend.position = "none") +
ggtitle(title) + theme(plot.title = element_text(hjust = 0.5))
} + xlab(xlabel) + ylab(ylabel)
alcohol % is shown to increase with quality and rating.
chlorides is shown to decrease with quality and rating.
density is shown to decrease with quality and rating.
bound.sulfur.dioxide is shown to decrease with quality and rating.
Based on our boxplots, it seems that..
alcohol
volatile.acidity
chlorides
density
bound.sulfur.dioxide
.. was shown in higher quality wines.
free.to.bound)I also decided to explore the relation of free.sulfur.dioxide to bound.sulfur.dioxide:
free.sulfur.dioxide to bound.sulfur.dioxide is preferred in higher quality wines.
Initially, we created a correlation matrix to determine which features were relative. We saw that the overall strongest correlation is 0.839 between density and residual.sugar. Though our main interest was to determine the features that were correlated with quality. The strongest relation with quality was alcohol with 0.436. After determining our main features, we catagorized the quality into rating: Bad (3-4), Average (5-7), Good (8-9) and generated the boxplots to examine the difference among quality. We saw that increasing alcohol but lowering volatile acidity, chlorides, density, and bound.sulfur.dioxide was shown in higher quality wines.
In order to examine various features, I decided to create a correlation matrix based on the features that were highly correlated with quality and examine those features among each other.
## volatile.acidity chlorides density alcohol
## volatile.acidity FALSE FALSE FALSE FALSE
## chlorides FALSE FALSE FALSE TRUE
## density FALSE FALSE FALSE TRUE
## alcohol FALSE TRUE TRUE FALSE
## bound.sulfur.dioxide FALSE FALSE TRUE TRUE
## bound.sulfur.dioxide
## volatile.acidity FALSE
## chlorides FALSE
## density TRUE
## alcohol TRUE
## bound.sulfur.dioxide FALSE
##
## Bad Average Good
## 183 4535 180
Since Average was accounting for 4535 observations, I decided to focus primarily on the Good and Bad wines.
free.sulfur.dioxide vs. bound.sulfur.dioxidetotal.sulfur.dioxide. It seems that free.sulfur.dioxide is preferred to be between 25-50 mg/liter while bound.sulfur.dioxide is preferred to be between 50-100 mg/liter for higher quality wines. It is also shown that higher quality wines have a trend of increasing free.sulfur.dioxide.
density vs. bound.sulfur.dioxideBad and Good quality wines, it seems that Good wines display a higher concentrated range than Bad wines. We can confirm that decreased density and bound.sulfur.dioxide tends to be in better wine.
alcohol vs. chloridesGood wines tends to be more concentrated than Bad wines. The plots show that higher alcohol and lower chlorides result in higher quality wines.
The selected features used in the linear model were based on its correlation with quality:
##
## Call:
## lm(formula = quality ~ volatile.acidity + alcohol + chlorides +
## bound.sulfur.dioxide + density, data = df)
##
## Coefficients:
## (Intercept) volatile.acidity alcohol
## -3.714e+01 -2.026e+00 3.880e-01
## chlorides bound.sulfur.dioxide density
## -1.278e+00 -3.398e-04 3.983e+01
We can see that the equation for quality is heavily depended on the density of the wine although alcohol was considered the highest correlated variable. I am speculating that it is because density is also relatively correlated with other features as well such as bound.sulfur.dioxide and chlorides.
cor(df[,c(2:13,16)], df$density)
## [,1]
## fixed.acidity 0.26533101
## volatile.acidity 0.02711385
## citric.acid 0.14950257
## residual.sugar 0.83896645
## chlorides 0.25721132
## free.sulfur.dioxide 0.29421041
## total.sulfur.dioxide 0.52988132
## density 1.00000000
## pH -0.09359149
## sulphates 0.07449315
## alcohol -0.78013762
## quality -0.30712331
## free.to.bound -0.07921315
We saw that the rating for Average contained 4535 observations. We focused particularly on Good and Bad wines which is roughly around 180 observations each. For this analysis, we are examining the relation among the features that are correlated with quality themselves. The most interesting observation we determined was that Good wine tended to show a relatively higher concentration compared to Bad wine. We were also able to confirm the boxplot trends seen in our bivariate analysis as well using scatterplots. With the linear model, we saw that density had the most influence in determining quality.
The correlation matrix allowed us to observe which features are relatively important that is in keen with our main interest quality. It allowed us to explore other variables in the multivariate analysis that weren’t just correlated with quality itself. We were able to see that the most correlated feature with the quality was alcohol followed by density.
alcohol vs. qualityWe saw that alcohol had the highest correlation of 0.4355 with quality. Increasing alcohol % by volume was displayed in wines scored higher.
alcohol vs. chloridesIn our multivariate analysis, we saw that Good wine tended to show a better concentrated range than Bad wines such as the plot above. We were also able to confirm that higher alcohol and lower chlorides tend to be preferred in Good wines which was also observed from our bivariate analysis.
The white wine dataset contained 4898 observations with 11 chemical properties. After exploring the dataset, we were able to successfully determine the main factors that affected wine quality which were through..
alcohol
volatile.acidity
chlorides
density
bound.sulfur.dioxide
However, quality is subjective and we cannot solely base quality from physiochemical properties. There are other properties not mentioned in the dataset that could play a bigger factor in the quality. Through our various plots, we were able to get an outlook on how a wine is rated based solely on physiochemical properties.
Throughout the project, we saw that there are many outliers that affected the initial distribution of the data and that not all were normally distributed. Thus rescaling and transformation was necessary in future plots. In this project, we were able to transform it appropriately to determine the best fit for our plots.
If possible, more Bad and Good wine data would allow us to have a better understanding on the wine’s quality.
http://www.biostathandbook.com/transformation.html
http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity
http://www.morethanorganic.com/sulphur-in-the-bottle
http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually
https://stackoverflow.com/questions/12328056/how-do-i-delete-rows-in-a-data-frame
https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/
https://stackoverflow.com/questions/22075592/creating-category-variables-from-numerical-variable-in-r
https://www.rdocumentation.org/packages/GGally/versions/1.3.2/topics/ggcorr
https://stackoverflow.com/questions/24651464/how-to-plot-several-regression-lines-in-same-scatter-plot-in-r
http://www.statmethods.net/management/subset.html
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html